-
Notifications
You must be signed in to change notification settings - Fork 0
Add ESGF links and more simulations to v1 data #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@chengzhuzhang This is ready for review. I added the ESGF links that had data available. Web rendering can be seen at https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60/html/v1/WaterCycle/simulation_data/simulation_table.html |
forsyth2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chengzhuzhang I added df0cfdb to begin the work of adding the large ensemble, but there's still a bit more to do on that, as described in this self-review.
Results from this commit can be seen at https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60_try2/html/v1/WaterCycle/simulation_data/simulation_table.html.
| @@ -0,0 +1,18 @@ | |||
| # This will be a problem if these simulations are ever removed from the publication archives! | |||
| for i in $(seq 1 20); do | |||
| hsi ln -s /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens$i /home/projects/e3sm/www/WaterCycle/E3SMv1/LR/LE_historical_ens$i | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HSI/HPSS adds a @ to the end of its symlinks, but that may just be a visual indicator. In any case, HPSS paths and data sizes aren't being displayed on https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60_try2/html/v1/WaterCycle/simulation_data/simulation_table.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the other data sets are showing a size of 0, but these doesn't even show a size of at all, so that makes me think the path isn't being found.
That said, they do show up in my output logs:
1 /home/projects/e3sm/www/WaterCycle/E3SMv1/LR/LE_historical_ens11
-----------------------
0 total 512-byte blocks, 0 Files (0 bytes)
So, it seems to read it as an empty path. I wonder if symlinks show zero size?
This one shows up as 0 in the table:
341850452 2 /home/projects/e3sm/www/WaterCycle/E3SMv1/HR/cori-haswell.20190513.F2010LRtunedHR.plus4K.noCNT.ne30_oECv3/
-----------------------
341850452 total 512-byte blocks, 2 Files (175,027,431,424 bytes)
So, it must be because 175x10^9 bytes is basically 0 TB (0.175x10^12). Indeed, this 113x10^12 one shows up as 113:
221651622324 820 /home/projects/e3sm/www/WaterCycle/E3SMv1/HR/20211021-maint-1.0-tro.A_WCYCLSSP585_CMIP6_HR.ne120_oRRS18v3_ICG.unc12-3rd-attempt/
-----------------------
221651622324 total 512-byte blocks, 820 Files (113,485,630,629,888 bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@forsyth2 could you double check the file size from /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens$i, hopefully there is no corruption during zstash archive or transfer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chengzhuzhang It's definitely an issue with the symlinks; I'm discussing with NERSC support. The original paths are fine, e.g.,:
hsi du /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens1
# 49970007900 95 /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens1/
# -----------------------
# 49970007900 total 512-byte blocks, 95 Files (25,584,644,044,800 bytes)
utils/simulations_v1_water_cycle.csv
Outdated
| v1, WaterCycle, LR, DAMIP, 20190404.DECKv1b_H1_hist-GHG.ne30_oEC.edison, edison, , damip_hist-GHG, 1, none, , | ||
| v1, WaterCycle, LR, DAMIP, 20190404.DECKv1b_H2_hist-GHG.ne30_oEC.edison, edison, , damip_hist-GHG, 2, none, , | ||
| v1, WaterCycle, LR, DAMIP, 20190404.DECKv1b_H3_hist-GHG.ne30_oEC.edison, edison, , damip_hist-GHG, 3, none, , | ||
| v1, WaterCycle, LR, LargeEnsemble, LE_historical_ens1, , , historical-large-ensemble, 1, none, , |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the large ensemble data available on ESGF? If so, what's the experiment name? I assume it's not historical-large-ensemble
And actually on that note, some of the other v1 data sets may be missing ESGF links simply because I guessed the experiment name wrong. (I'm not seeing a way to know the experiment, or ensemble number for that matter, from https://e3sm.atlassian.net/wiki/spaces/ED/pages/4495441922/V1+Simulation+backfill+WIP)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the v1 large ensemble data are available on ESGF in CMIP format. the experiment and ensemble names can be found here:https://github.com/E3SM-Project/datasm/blob/master/datasm/resources/v1_LE_dataset_spec.yaml. @TonyB9000 I think that you documented the mapping between LE native ensemble index to CMIP ensemble, e.g. r1i2p2f1. but forgot if that is for v1 or v2...Could you help check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E3SM LE_archive_refactor.xlsb.xlsx
I think this was v1, since if it was v2 I would have had to distinguish them, but nothing in the naming indicated v1 or v2.
I keep poking around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory "/p/user_pub/e3sm/archive/External/" holds 5 related subdirectories:
E3SMv1_LE
E3SMv1_LE_ext
E3SMv1_LE_ssp370
E3SMv2_LE
E3SMv2_LE_ssp370
The E3SMv2_LE has a file I created called "Arch_Translator_E3SMv2_LE", and it holds
Ensemble,Archive,Branch_time_in_parent
ens6,v2.LR.historical_0111,40150.0
ens7,v2.LR.historical_0121,43800.0
ens8,v2.LR.historical_0131,47450.0
ens9,v2.LR.historical_0141,51100.0
ens10,v2.LR.historical_0161,58400.0
ens11,v2.LR.historical_0171,62050.0
ens12,v2.LR.historical_0181,65700.0
ens13,v2.LR.historical_0191,69350.0
ens14,v2.LR.historical_0211,76650.0
ens15,v2.LR.historical_0221,80300.0
ens16,v2.LR.historical_0231,83950.0
ens17,v2.LR.historical_0241,87600.0
ens18,v2.LR.historical_0261,94900.0
ens19,v2.LR.historical_0271,98550.0
ens20,v2.LR.historical_0281,102200.0
ens21,v2.LR.historical_0291,105850.0
(ensembles 1-5 missing because created independent of LE in v2 historical)
Likewise, E3SMv2_LE_ssp370/ holds a file named "Arch_Translator_E3SMv2_LE_ssp370", and it holds:
Ensemble,Archive,Branch_time_in_parent
ens1,v2.LR.SSP370_0101,36500.0
ens6,v2.LR.SSP370_0111,40150.0
ens7,v2.LR.SSP370_0121,43800.0
ens8,v2.LR.SSP370_0131,47450.0
ens9,v2.LR.SSP370_0141,51100.0
ens2,v2.LR.SSP370_0151,54750.0
ens10,v2.LR.SSP370_0161,58400.0
ens11,v2.LR.SSP370_0171,62050.0
ens12,v2.LR.SSP370_0181,65700.0
ens13,v2.LR.SSP370_0191,69350.0
ens3,v2.LR.SSP370_0201,73000.0
ens14,v2.LR.SSP370_0211,76650.0
ens15,v2.LR.SSP370_0221,80300.0
ens16,v2.LR.SSP370_0231,83950.0
ens17,v2.LR.SSP370_0241,87600.0
ens4,v2.LR.SSP370_0251,91250.0
ens18,v2.LR.SSP370_0261,94900.0
ens19,v2.LR.SSP370_0271,98550.0
ens20,v2.LR.SSP370_0281,102200.0
ens21,v2.LR.SSP370_0291,105850.0
ens5,v2.LR.SSP370_0301,109500.0
I don't know how much that helps. Special functions were written that translate a given CMIP6 dataset_id to its corresponding E3SM "native" dataset_id. But for those functions to work (parent_native_dsid.sh, etc) one must supply the alternate "Archive_Map" for v1 or v2 LE, as these are not part of the E3SM "dataset_spec.yaml".
We can probably generate a "cmip-case" to "native-case" mapping file. Might take a day or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The historical cases and ssp370 cases are independent.
Walking the tree for "Project: E3SM" a bit further, you will see a "cmip_case", and yes, there is a 1-to-1 mapping between each native "ens#" and the corresponding "Project: CMIP" case, as in
"(native) ens#" corresponds to "(CMIP6) r#" of the variant label ("realization index").
E3SM:
'1_0_LE':
historical:
start: 1850
end: 2014
ens:
- ens1
- ens2
- ens3
- ens4
- ens5
- ens6
- ens7
- ens8
- ens9
- ens10
- ens11
- ens12
- ens13
- ens14
- ens15
- ens16
- ens17
- ens18
- ens19
- ens20
except:
- TREFMNAV
- TREFMXAV
campaign: DECK-v1
science_driver: Water Cycle
cmip_case: CMIP6.CMIP.UCSB.E3SM-1-0.historical
If you put these lines into your (acme1) ~/.bashrc file:
export DSM_GETPATH=/p/user_pub/e3sm/staging/Relocation/.dsm_get_root_path.sh
alias list_e3sm="python /p/user_pub/e3sm/staging/tools/list_e3sm_dsids.py"
alias list_cmip="python /p/user_pub/e3sm/staging/tools/list_cmip6_dsids.py"
(and issue "source ~/.bashrc")
And then
1. git clone https://github.com/E3SM-Project/datasm.git
2. cd datasm
3. conda env create -n <env_name> -f conda-env/prod.yml
4. conda activate <env_name>
5. pip install .
Then your environment will have "datasm/util" and its functions available to any python, via "import datasm.util" or "from datasm.util import (selected functions)".
You can issue list_e3sm -d <path_to_the_dataset_spec> and generate ALL E3SM dataset_ids for that dataset_spec.
Likewise, use list_cmip -d <path_to_the_dataset_spec> to generate all corresponding CMIP6 dataset_ids.
These utilities will "walk" the respected YAML trees to express every branch. If no "-d dataset_spec" is given, the default dataset_spec.yaml (staging/resource/dataset_spec.yaml) is applied.
Other than this, I'm not quite sure what you need. Perhaps I can generate stuff for you, if I understand what you are looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TonyB9000 I'm just trying to determine the URL to link to. That requires knowing the query parameter values. But upon further inspection, it appears ESGF links might not be available for the large ensemble, in which case it's a moot point.

^Notice none of the available experiment IDs suggest large ensemble
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the v1 LE is published under project UCSB, if you leave out the Institution ID, the large ensemble should pop up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. But for v2 LE the project listed is "E3SM-Project" (21 ensembles). The CMIP6 datasets are not really distinguished as "LE" except that the variant labels range from r6 to r21 (16 ensembles). The native data is distinguished by Model = "2_0_LE". Likewise, the v1_LE native data has Model = "1_0_LE" (But native data is no longer available via ESGF/Metagrid.)
| done | ||
|
|
||
| # Symlink last remaining large simulation | ||
| # This will be a problem if ndk ever deletes the source! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chengzhuzhang I meant to include this in the self-review I just posted. The symlinks are fine as long as we are guaranteed that people don't delete the source directories like /home/projects/e3sm/www/publication-archives/ or /home/n/ndk/2019/theta.20190910.branch_noCNT.n825def.unc06.A_WCYCL1950S_CMIP6_HR.ne120_oRRS18v3_ICG. Is that something we can be sure of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, tagging directory owners @TonyB9000 and @ndkeen, please make sure not to delete above directories.
forsyth2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chengzhuzhang @TonyB9000 Ok I've added the large ensemble & the existing ESGF links. See https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60_try9/html/v1/WaterCycle/simulation_data/simulation_table.html for a rendered version of the web page. This is ready for final review.
@TonyB9000 I've noted symlinked HPSS paths with (symlink) ...hpss_path...; is that going to interfere with any automated data retrieval you do from these pages?
|
@forsyth2 thanks for adding v1 LE and the ESGF links. I just note that for the the simulation overview page, could you also add:
|
|
@forsyth2 @chengzhuzhang Yes, it most certainly will. A column labeled "HPSS Path" should not be polluted with non-functional commentary. People need to understand that we use computers to automate. As nice as it is to have human-friendly material, such should be secondary to functional considerations. Personally, I would have the default date-timestamp on ALL log-files be 8+ HEX chars (like "D85A33B2", representing Epoch-seconds). Very unfriendly to look at? Then pass it through a "prettifier" that converts the log entry to "2025-07-15 09:42:30", or if you like, "The Fifteenth Day of Our Lord, July 2025 AD, at the 9th hour, 42nd minute, and 30th second of the morning in the Pacific Standard Timezone". Instead, I will need to munge code to toss out everything in the returned HPSS-Path that occurs before the first "/". For now, at least. |
This page is for humans though. It almost seems like we should have some sort of a output file meant for a computer to read, rather than having a program parse the information from HTML... As I noted in a previous email:
That is, I believe the fundamental issue here is that we're relying on HTML serving both computers & humans, when we should just be outputting computer-readable material elsewhere. |
|
@TonyB9000 if you provide me with an exact list of data you need from these tables, I should be able to easily produce that in a machine readable file. |
That would need to be part of a separate PR though, as the work is distinct from adding the v1 data. |
Indeed. In fact, to avoid inconsistencies, the focus should be to produce the "machine-readable" version of materials, and then use that as a primary source for HTML creation and human readable stuff, augmented with commentaries, etc. Machine ==> Human: Easy |
|
@TonyB9000 Great, I prototyped a solution at #61. Can you please review #61 (review)? If you approve of that, I think I can go ahead and merge both PRs. |
I should clarify. At runtime, I consult my own "NERSC_Archive_Locator" file, whose entries are (e.g.): This was created by MANUALLY scraping the HTML data. Note that the hyperlinks are removed, I don't use the first column. At runtime (due to the magic of having created a "local Archive_Map" (paths to archives on Chrysalis AND zstash file extraction patterns), if I don't have the data in the warehouse BUT it is listed in the Local Archive_Map, I take the "basename" of the archive path (the case_id, like "DECK,v2.NARRM.piControl"), and I look it up in the NERSC_Archive_Locator (field 2). Where a match is found, I return fields 3 (Volume) and 6 (NERSC HPSS Archive Path). I then (hope to) use "zstash --check" to pull over the archive in question. Since I (presently) create the NERSC Archive_Locator manually, I simply edit out extraneous material. |
|
I'm a little confused. If the current process is manual, then what's the problem with having "(symlink) " in the hpss path cell? In any case, #61 should pave the way to full automation well. |
|
@forsyth2 I guess "semi-manual", as I do use tools to strip formatting from the HTML copy. But yes, it is only a minor inconvenience. These things do add up (mapfile/region-file selection, user_metadata updates, etc) so I am simply venting my frustrations on the system overall. Each of these little (manual) things are:
Hence, forcing automation not only eases the manual burden, it isolates decisions to a "fix-it-once-and-forget-it" regime of operation. |

Add ESGF links and more simulations to v1 data